pip install ydata-profiling #pandas-profiling was renamed to ydata-profiling
Requirement already satisfied: ydata-profiling in /Users/vibhavmisra/Softwares/anaconda3/lib/python3.12/site-packages (0.0.dev0) Requirement already satisfied: scipy<1.14,>=1.4.1 in /Users/vibhavmisra/Softwares/anaconda3/lib/python3.12/site-packages (from ydata-profiling) (1.13.1) Requirement already satisfied: pandas!=1.4.0,<3,>1.1 in /Users/vibhavmisra/Softwares/anaconda3/lib/python3.12/site-packages (from ydata-profiling) (2.2.2) Requirement already satisfied: matplotlib<3.9,>=3.2 in /Users/vibhavmisra/Softwares/anaconda3/lib/python3.12/site-packages (from ydata-profiling) (3.8.4) Requirement already satisfied: pydantic>=2 in /Users/vibhavmisra/Softwares/anaconda3/lib/python3.12/site-packages (from ydata-profiling) (2.5.3) Requirement already satisfied: PyYAML<6.1,>=5.0.0 in /Users/vibhavmisra/Softwares/anaconda3/lib/python3.12/site-packages (from ydata-profiling) (6.0.1) Requirement already satisfied: jinja2<3.2,>=2.11.1 in /Users/vibhavmisra/Softwares/anaconda3/lib/python3.12/site-packages (from ydata-profiling) (3.1.4) Requirement already satisfied: visions<0.7.7,>=0.7.5 in /Users/vibhavmisra/Softwares/anaconda3/lib/python3.12/site-packages (from visions[type_image_path]<0.7.7,>=0.7.5->ydata-profiling) (0.7.6) Requirement already satisfied: numpy<2,>=1.16.0 in /Users/vibhavmisra/Softwares/anaconda3/lib/python3.12/site-packages (from ydata-profiling) (1.26.4) Requirement already satisfied: htmlmin==0.1.12 in /Users/vibhavmisra/Softwares/anaconda3/lib/python3.12/site-packages (from ydata-profiling) (0.1.12) Requirement already satisfied: phik<0.13,>=0.11.1 in /Users/vibhavmisra/Softwares/anaconda3/lib/python3.12/site-packages (from ydata-profiling) (0.12.3) Requirement already satisfied: requests<3,>=2.24.0 in /Users/vibhavmisra/Softwares/anaconda3/lib/python3.12/site-packages (from ydata-profiling) (2.32.2) Requirement already satisfied: tqdm<5,>=4.48.2 in /Users/vibhavmisra/Softwares/anaconda3/lib/python3.12/site-packages (from ydata-profiling) (4.66.4) Requirement already satisfied: seaborn<0.14,>=0.10.1 in /Users/vibhavmisra/Softwares/anaconda3/lib/python3.12/site-packages (from ydata-profiling) (0.13.2) Requirement already satisfied: multimethod<2,>=1.4 in /Users/vibhavmisra/Softwares/anaconda3/lib/python3.12/site-packages (from ydata-profiling) (1.9.1) Requirement already satisfied: statsmodels<1,>=0.13.2 in /Users/vibhavmisra/Softwares/anaconda3/lib/python3.12/site-packages (from ydata-profiling) (0.14.2) Requirement already satisfied: typeguard<5,>=3 in /Users/vibhavmisra/Softwares/anaconda3/lib/python3.12/site-packages (from ydata-profiling) (4.2.1) Requirement already satisfied: imagehash==4.3.1 in /Users/vibhavmisra/Softwares/anaconda3/lib/python3.12/site-packages (from ydata-profiling) (4.3.1) Requirement already satisfied: wordcloud>=1.9.1 in /Users/vibhavmisra/Softwares/anaconda3/lib/python3.12/site-packages (from ydata-profiling) (1.9.4) Requirement already satisfied: dacite>=1.8 in /Users/vibhavmisra/Softwares/anaconda3/lib/python3.12/site-packages (from ydata-profiling) (1.8.1) Requirement already satisfied: numba<1,>=0.56.0 in /Users/vibhavmisra/Softwares/anaconda3/lib/python3.12/site-packages (from ydata-profiling) (0.59.1) Requirement already satisfied: pillow in /Users/vibhavmisra/Softwares/anaconda3/lib/python3.12/site-packages (from imagehash==4.3.1->ydata-profiling) (10.3.0) Requirement already satisfied: PyWavelets in /Users/vibhavmisra/Softwares/anaconda3/lib/python3.12/site-packages (from imagehash==4.3.1->ydata-profiling) (1.5.0) Requirement already satisfied: MarkupSafe>=2.0 in /Users/vibhavmisra/Softwares/anaconda3/lib/python3.12/site-packages (from jinja2<3.2,>=2.11.1->ydata-profiling) (2.1.3) Requirement already satisfied: contourpy>=1.0.1 in /Users/vibhavmisra/Softwares/anaconda3/lib/python3.12/site-packages (from matplotlib<3.9,>=3.2->ydata-profiling) (1.2.0) Requirement already satisfied: cycler>=0.10 in /Users/vibhavmisra/Softwares/anaconda3/lib/python3.12/site-packages (from matplotlib<3.9,>=3.2->ydata-profiling) (0.11.0) Requirement already satisfied: fonttools>=4.22.0 in /Users/vibhavmisra/Softwares/anaconda3/lib/python3.12/site-packages (from matplotlib<3.9,>=3.2->ydata-profiling) (4.51.0) Requirement already satisfied: kiwisolver>=1.3.1 in /Users/vibhavmisra/Softwares/anaconda3/lib/python3.12/site-packages (from matplotlib<3.9,>=3.2->ydata-profiling) (1.4.4) Requirement already satisfied: packaging>=20.0 in /Users/vibhavmisra/Softwares/anaconda3/lib/python3.12/site-packages (from matplotlib<3.9,>=3.2->ydata-profiling) (23.2) Requirement already satisfied: pyparsing>=2.3.1 in /Users/vibhavmisra/Softwares/anaconda3/lib/python3.12/site-packages (from matplotlib<3.9,>=3.2->ydata-profiling) (3.0.9) Requirement already satisfied: python-dateutil>=2.7 in /Users/vibhavmisra/Softwares/anaconda3/lib/python3.12/site-packages (from matplotlib<3.9,>=3.2->ydata-profiling) (2.9.0.post0) Requirement already satisfied: llvmlite<0.43,>=0.42.0dev0 in /Users/vibhavmisra/Softwares/anaconda3/lib/python3.12/site-packages (from numba<1,>=0.56.0->ydata-profiling) (0.42.0) Requirement already satisfied: pytz>=2020.1 in /Users/vibhavmisra/Softwares/anaconda3/lib/python3.12/site-packages (from pandas!=1.4.0,<3,>1.1->ydata-profiling) (2024.1) Requirement already satisfied: tzdata>=2022.7 in /Users/vibhavmisra/Softwares/anaconda3/lib/python3.12/site-packages (from pandas!=1.4.0,<3,>1.1->ydata-profiling) (2023.3) Requirement already satisfied: joblib>=0.14.1 in /Users/vibhavmisra/Softwares/anaconda3/lib/python3.12/site-packages (from phik<0.13,>=0.11.1->ydata-profiling) (1.4.2) Requirement already satisfied: annotated-types>=0.4.0 in /Users/vibhavmisra/Softwares/anaconda3/lib/python3.12/site-packages (from pydantic>=2->ydata-profiling) (0.6.0) Requirement already satisfied: pydantic-core==2.14.6 in /Users/vibhavmisra/Softwares/anaconda3/lib/python3.12/site-packages (from pydantic>=2->ydata-profiling) (2.14.6) Requirement already satisfied: typing-extensions>=4.6.1 in /Users/vibhavmisra/Softwares/anaconda3/lib/python3.12/site-packages (from pydantic>=2->ydata-profiling) (4.11.0) Requirement already satisfied: charset-normalizer<4,>=2 in /Users/vibhavmisra/Softwares/anaconda3/lib/python3.12/site-packages (from requests<3,>=2.24.0->ydata-profiling) (2.0.4) Requirement already satisfied: idna<4,>=2.5 in /Users/vibhavmisra/Softwares/anaconda3/lib/python3.12/site-packages (from requests<3,>=2.24.0->ydata-profiling) (3.7) Requirement already satisfied: urllib3<3,>=1.21.1 in /Users/vibhavmisra/Softwares/anaconda3/lib/python3.12/site-packages (from requests<3,>=2.24.0->ydata-profiling) (2.2.2) Requirement already satisfied: certifi>=2017.4.17 in /Users/vibhavmisra/Softwares/anaconda3/lib/python3.12/site-packages (from requests<3,>=2.24.0->ydata-profiling) (2025.1.31) Requirement already satisfied: patsy>=0.5.6 in /Users/vibhavmisra/Softwares/anaconda3/lib/python3.12/site-packages (from statsmodels<1,>=0.13.2->ydata-profiling) (0.5.6) Requirement already satisfied: attrs>=19.3.0 in /Users/vibhavmisra/Softwares/anaconda3/lib/python3.12/site-packages (from visions<0.7.7,>=0.7.5->visions[type_image_path]<0.7.7,>=0.7.5->ydata-profiling) (23.1.0) Requirement already satisfied: networkx>=2.4 in /Users/vibhavmisra/Softwares/anaconda3/lib/python3.12/site-packages (from visions<0.7.7,>=0.7.5->visions[type_image_path]<0.7.7,>=0.7.5->ydata-profiling) (3.2.1) Requirement already satisfied: six in /Users/vibhavmisra/Softwares/anaconda3/lib/python3.12/site-packages (from patsy>=0.5.6->statsmodels<1,>=0.13.2->ydata-profiling) (1.16.0) Note: you may need to restart the kernel to use updated packages.
pip install sweetviz
Requirement already satisfied: sweetviz in /Users/vibhavmisra/Softwares/anaconda3/lib/python3.12/site-packages (2.3.1) Requirement already satisfied: pandas!=1.0.0,!=1.0.1,!=1.0.2,>=0.25.3 in /Users/vibhavmisra/Softwares/anaconda3/lib/python3.12/site-packages (from sweetviz) (2.2.2) Requirement already satisfied: numpy>=1.16.0 in /Users/vibhavmisra/Softwares/anaconda3/lib/python3.12/site-packages (from sweetviz) (1.26.4) Requirement already satisfied: matplotlib>=3.1.3 in /Users/vibhavmisra/Softwares/anaconda3/lib/python3.12/site-packages (from sweetviz) (3.8.4) Requirement already satisfied: tqdm>=4.43.0 in /Users/vibhavmisra/Softwares/anaconda3/lib/python3.12/site-packages (from sweetviz) (4.66.4) Requirement already satisfied: scipy>=1.3.2 in /Users/vibhavmisra/Softwares/anaconda3/lib/python3.12/site-packages (from sweetviz) (1.13.1) Requirement already satisfied: jinja2>=2.11.1 in /Users/vibhavmisra/Softwares/anaconda3/lib/python3.12/site-packages (from sweetviz) (3.1.4) Requirement already satisfied: importlib-resources>=1.2.0 in /Users/vibhavmisra/Softwares/anaconda3/lib/python3.12/site-packages (from sweetviz) (6.5.2) Requirement already satisfied: MarkupSafe>=2.0 in /Users/vibhavmisra/Softwares/anaconda3/lib/python3.12/site-packages (from jinja2>=2.11.1->sweetviz) (2.1.3) Requirement already satisfied: contourpy>=1.0.1 in /Users/vibhavmisra/Softwares/anaconda3/lib/python3.12/site-packages (from matplotlib>=3.1.3->sweetviz) (1.2.0) Requirement already satisfied: cycler>=0.10 in /Users/vibhavmisra/Softwares/anaconda3/lib/python3.12/site-packages (from matplotlib>=3.1.3->sweetviz) (0.11.0) Requirement already satisfied: fonttools>=4.22.0 in /Users/vibhavmisra/Softwares/anaconda3/lib/python3.12/site-packages (from matplotlib>=3.1.3->sweetviz) (4.51.0) Requirement already satisfied: kiwisolver>=1.3.1 in /Users/vibhavmisra/Softwares/anaconda3/lib/python3.12/site-packages (from matplotlib>=3.1.3->sweetviz) (1.4.4) Requirement already satisfied: packaging>=20.0 in /Users/vibhavmisra/Softwares/anaconda3/lib/python3.12/site-packages (from matplotlib>=3.1.3->sweetviz) (23.2) Requirement already satisfied: pillow>=8 in /Users/vibhavmisra/Softwares/anaconda3/lib/python3.12/site-packages (from matplotlib>=3.1.3->sweetviz) (10.3.0) Requirement already satisfied: pyparsing>=2.3.1 in /Users/vibhavmisra/Softwares/anaconda3/lib/python3.12/site-packages (from matplotlib>=3.1.3->sweetviz) (3.0.9) Requirement already satisfied: python-dateutil>=2.7 in /Users/vibhavmisra/Softwares/anaconda3/lib/python3.12/site-packages (from matplotlib>=3.1.3->sweetviz) (2.9.0.post0) Requirement already satisfied: pytz>=2020.1 in /Users/vibhavmisra/Softwares/anaconda3/lib/python3.12/site-packages (from pandas!=1.0.0,!=1.0.1,!=1.0.2,>=0.25.3->sweetviz) (2024.1) Requirement already satisfied: tzdata>=2022.7 in /Users/vibhavmisra/Softwares/anaconda3/lib/python3.12/site-packages (from pandas!=1.0.0,!=1.0.1,!=1.0.2,>=0.25.3->sweetviz) (2023.3) Requirement already satisfied: six>=1.5 in /Users/vibhavmisra/Softwares/anaconda3/lib/python3.12/site-packages (from python-dateutil>=2.7->matplotlib>=3.1.3->sweetviz) (1.16.0) Note: you may need to restart the kernel to use updated packages.
import pandas as pd
df = pd.read_csv('dataset.csv')
df.head()
customerID | gender | SeniorCitizen | Partner | Dependents | tenure | PhoneService | MultipleLines | InternetService | OnlineSecurity | ... | DeviceProtection | TechSupport | StreamingTV | StreamingMovies | Contract | PaperlessBilling | PaymentMethod | MonthlyCharges | TotalCharges | Churn | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 7590-VHVEG | Female | 0 | Yes | No | 1 | No | No phone service | DSL | No | ... | No | No | No | No | Month-to-month | Yes | Electronic check | 29.85 | 29.85 | No |
1 | 5575-GNVDE | Male | 0 | No | No | 34 | Yes | No | DSL | Yes | ... | Yes | No | No | No | One year | No | Mailed check | 56.95 | 1889.5 | No |
2 | 3668-QPYBK | Male | 0 | No | No | 2 | Yes | No | DSL | Yes | ... | No | No | No | No | Month-to-month | Yes | Mailed check | 53.85 | 108.15 | Yes |
3 | 7795-CFOCW | Male | 0 | No | No | 45 | No | No phone service | DSL | Yes | ... | Yes | Yes | No | No | One year | No | Bank transfer (automatic) | 42.30 | 1840.75 | No |
4 | 9237-HQITU | Female | 0 | No | No | 2 | Yes | No | Fiber optic | No | ... | No | No | No | No | Month-to-month | Yes | Electronic check | 70.70 | 151.65 | Yes |
5 rows × 21 columns
df.describe()
SeniorCitizen | tenure | MonthlyCharges | |
---|---|---|---|
count | 7043.000000 | 7043.000000 | 7043.000000 |
mean | 0.162147 | 32.371149 | 64.761692 |
std | 0.368612 | 24.559481 | 30.090047 |
min | 0.000000 | 0.000000 | 18.250000 |
25% | 0.000000 | 9.000000 | 35.500000 |
50% | 0.000000 | 29.000000 | 70.350000 |
75% | 0.000000 | 55.000000 | 89.850000 |
max | 1.000000 | 72.000000 | 118.750000 |
# ISSUE: 'TotalCharges' which is a numerical column is missing from summary statistics.
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 7043 entries, 0 to 7042 Data columns (total 21 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 customerID 7043 non-null object 1 gender 7043 non-null object 2 SeniorCitizen 7043 non-null int64 3 Partner 7043 non-null object 4 Dependents 7043 non-null object 5 tenure 7043 non-null int64 6 PhoneService 7043 non-null object 7 MultipleLines 7043 non-null object 8 InternetService 7043 non-null object 9 OnlineSecurity 7043 non-null object 10 OnlineBackup 7043 non-null object 11 DeviceProtection 7043 non-null object 12 TechSupport 7043 non-null object 13 StreamingTV 7043 non-null object 14 StreamingMovies 7043 non-null object 15 Contract 7043 non-null object 16 PaperlessBilling 7043 non-null object 17 PaymentMethod 7043 non-null object 18 MonthlyCharges 7043 non-null float64 19 TotalCharges 7043 non-null object 20 Churn 7043 non-null object dtypes: float64(1), int64(2), object(18) memory usage: 1.1+ MB
# Datatype for the columnn 'TotalCharges' is incorrect.
df['TotalCharges'] = df['TotalCharges'].astype('float')
--------------------------------------------------------------------------- ValueError Traceback (most recent call last) Cell In[174], line 1 ----> 1 df['TotalCharges'] = df['TotalCharges'].astype('float') File ~/Softwares/anaconda3/lib/python3.12/site-packages/pandas/core/generic.py:6643, in NDFrame.astype(self, dtype, copy, errors) 6637 results = [ 6638 ser.astype(dtype, copy=copy, errors=errors) for _, ser in self.items() 6639 ] 6641 else: 6642 # else, only a single dtype is given -> 6643 new_data = self._mgr.astype(dtype=dtype, copy=copy, errors=errors) 6644 res = self._constructor_from_mgr(new_data, axes=new_data.axes) 6645 return res.__finalize__(self, method="astype") File ~/Softwares/anaconda3/lib/python3.12/site-packages/pandas/core/internals/managers.py:430, in BaseBlockManager.astype(self, dtype, copy, errors) 427 elif using_copy_on_write(): 428 copy = False --> 430 return self.apply( 431 "astype", 432 dtype=dtype, 433 copy=copy, 434 errors=errors, 435 using_cow=using_copy_on_write(), 436 ) File ~/Softwares/anaconda3/lib/python3.12/site-packages/pandas/core/internals/managers.py:363, in BaseBlockManager.apply(self, f, align_keys, **kwargs) 361 applied = b.apply(f, **kwargs) 362 else: --> 363 applied = getattr(b, f)(**kwargs) 364 result_blocks = extend_blocks(applied, result_blocks) 366 out = type(self).from_blocks(result_blocks, self.axes) File ~/Softwares/anaconda3/lib/python3.12/site-packages/pandas/core/internals/blocks.py:758, in Block.astype(self, dtype, copy, errors, using_cow, squeeze) 755 raise ValueError("Can not squeeze with more than one column.") 756 values = values[0, :] # type: ignore[call-overload] --> 758 new_values = astype_array_safe(values, dtype, copy=copy, errors=errors) 760 new_values = maybe_coerce_values(new_values) 762 refs = None File ~/Softwares/anaconda3/lib/python3.12/site-packages/pandas/core/dtypes/astype.py:237, in astype_array_safe(values, dtype, copy, errors) 234 dtype = dtype.numpy_dtype 236 try: --> 237 new_values = astype_array(values, dtype, copy=copy) 238 except (ValueError, TypeError): 239 # e.g. _astype_nansafe can fail on object-dtype of strings 240 # trying to convert to float 241 if errors == "ignore": File ~/Softwares/anaconda3/lib/python3.12/site-packages/pandas/core/dtypes/astype.py:182, in astype_array(values, dtype, copy) 179 values = values.astype(dtype, copy=copy) 181 else: --> 182 values = _astype_nansafe(values, dtype, copy=copy) 184 # in pandas we don't store numpy str dtypes, so convert to object 185 if isinstance(dtype, np.dtype) and issubclass(values.dtype.type, str): File ~/Softwares/anaconda3/lib/python3.12/site-packages/pandas/core/dtypes/astype.py:133, in _astype_nansafe(arr, dtype, copy, skipna) 129 raise ValueError(msg) 131 if copy or arr.dtype == object or dtype == object: 132 # Explicit copy, or required since NumPy can't view from / to object. --> 133 return arr.astype(dtype, copy=True) 135 return arr.astype(dtype, copy=copy) ValueError: could not convert string to float: ' '
# Unable to fix the datatype because of empty strings.
empty_rows = df[df['TotalCharges'].str.strip() == '']
print(empty_rows)
customerID gender SeniorCitizen Partner Dependents tenure \ 488 4472-LVYGI Female 0 Yes Yes 0 753 3115-CZMZD Male 0 No Yes 0 936 5709-LVOEQ Female 0 Yes Yes 0 1082 4367-NUYAO Male 0 Yes Yes 0 1340 1371-DWPAZ Female 0 Yes Yes 0 3331 7644-OMVMY Male 0 Yes Yes 0 3826 3213-VVOLG Male 0 Yes Yes 0 4380 2520-SGTTA Female 0 Yes Yes 0 5218 2923-ARZLG Male 0 Yes Yes 0 6670 4075-WKNIU Female 0 Yes Yes 0 6754 2775-SEFEE Male 0 No Yes 0 PhoneService MultipleLines InternetService OnlineSecurity ... \ 488 No No phone service DSL Yes ... 753 Yes No No No internet service ... 936 Yes No DSL Yes ... 1082 Yes Yes No No internet service ... 1340 No No phone service DSL Yes ... 3331 Yes No No No internet service ... 3826 Yes Yes No No internet service ... 4380 Yes No No No internet service ... 5218 Yes No No No internet service ... 6670 Yes Yes DSL No ... 6754 Yes Yes DSL Yes ... DeviceProtection TechSupport StreamingTV \ 488 Yes Yes Yes 753 No internet service No internet service No internet service 936 Yes No Yes 1082 No internet service No internet service No internet service 1340 Yes Yes Yes 3331 No internet service No internet service No internet service 3826 No internet service No internet service No internet service 4380 No internet service No internet service No internet service 5218 No internet service No internet service No internet service 6670 Yes Yes Yes 6754 No Yes No StreamingMovies Contract PaperlessBilling \ 488 No Two year Yes 753 No internet service Two year No 936 Yes Two year No 1082 No internet service Two year No 1340 No Two year No 3331 No internet service Two year No 3826 No internet service Two year No 4380 No internet service Two year No 5218 No internet service One year Yes 6670 No Two year No 6754 No Two year Yes PaymentMethod MonthlyCharges TotalCharges Churn 488 Bank transfer (automatic) 52.55 No 753 Mailed check 20.25 No 936 Mailed check 80.85 No 1082 Mailed check 25.75 No 1340 Credit card (automatic) 56.05 No 3331 Mailed check 19.85 No 3826 Mailed check 25.35 No 4380 Mailed check 20.00 No 5218 Mailed check 19.70 No 6670 Mailed check 73.35 No 6754 Bank transfer (automatic) 61.90 No [11 rows x 21 columns]
# Found 11 such rows where 'TotalCharges' was an empty string.
import numpy as np
df['TotalCharges'] = df['TotalCharges'].replace(r'^\s*$', np.nan, regex=True)
# Replaced the empty data cells with NaN.
df['TotalCharges'] = df['TotalCharges'].astype('float')
df.describe()
SeniorCitizen | tenure | MonthlyCharges | TotalCharges | |
---|---|---|---|---|
count | 7043.000000 | 7043.000000 | 7043.000000 | 7032.000000 |
mean | 0.162147 | 32.371149 | 64.761692 | 2283.300441 |
std | 0.368612 | 24.559481 | 30.090047 | 2266.771362 |
min | 0.000000 | 0.000000 | 18.250000 | 18.800000 |
25% | 0.000000 | 9.000000 | 35.500000 | 401.450000 |
50% | 0.000000 | 29.000000 | 70.350000 | 1397.475000 |
75% | 0.000000 | 55.000000 | 89.850000 | 3794.737500 |
max | 1.000000 | 72.000000 | 118.750000 | 8684.800000 |
missing_values = df.isnull().sum()
missing_values
customerID 0 gender 0 SeniorCitizen 0 Partner 0 Dependents 0 tenure 0 PhoneService 0 MultipleLines 0 InternetService 0 OnlineSecurity 0 OnlineBackup 0 DeviceProtection 0 TechSupport 0 StreamingTV 0 StreamingMovies 0 Contract 0 PaperlessBilling 0 PaymentMethod 0 MonthlyCharges 0 TotalCharges 11 Churn 0 dtype: int64
df = df.dropna(subset=['TotalCharges'])
import seaborn as sns
import matplotlib.pyplot as plt
plt.figure(figsize=(8, 4))
sns.boxplot(x=df['tenure'])
plt.show()
plt.figure(figsize=(8, 4))
sns.boxplot(x=df['MonthlyCharges'])
plt.show()
plt.figure(figsize=(8, 4))
sns.boxplot(x=df['TotalCharges'])
plt.show()
Q1 = df['tenure'].quantile(0.25)
Q3 = df['tenure'].quantile(0.75)
IQR = Q3 - Q1
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR
outliers = df[(df['tenure'] < lower_bound) | (df['tenure'] > upper_bound)]
print(outliers)
Q1 = df['MonthlyCharges'].quantile(0.25)
Q3 = df['MonthlyCharges'].quantile(0.75)
IQR = Q3 - Q1
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR
outliers = df[(df['MonthlyCharges'] < lower_bound) | (df['MonthlyCharges'] > upper_bound)]
print(outliers)
Q1 = df['TotalCharges'].quantile(0.25)
Q3 = df['TotalCharges'].quantile(0.75)
IQR = Q3 - Q1
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR
outliers = df[(df['TotalCharges'] < lower_bound) | (df['TotalCharges'] > upper_bound)]
print(outliers)
Empty DataFrame Columns: [customerID, gender, SeniorCitizen, Partner, Dependents, tenure, PhoneService, MultipleLines, InternetService, OnlineSecurity, OnlineBackup, DeviceProtection, TechSupport, StreamingTV, StreamingMovies, Contract, PaperlessBilling, PaymentMethod, MonthlyCharges, TotalCharges, Churn] Index: [] [0 rows x 21 columns] Empty DataFrame Columns: [customerID, gender, SeniorCitizen, Partner, Dependents, tenure, PhoneService, MultipleLines, InternetService, OnlineSecurity, OnlineBackup, DeviceProtection, TechSupport, StreamingTV, StreamingMovies, Contract, PaperlessBilling, PaymentMethod, MonthlyCharges, TotalCharges, Churn] Index: [] [0 rows x 21 columns] Empty DataFrame Columns: [customerID, gender, SeniorCitizen, Partner, Dependents, tenure, PhoneService, MultipleLines, InternetService, OnlineSecurity, OnlineBackup, DeviceProtection, TechSupport, StreamingTV, StreamingMovies, Contract, PaperlessBilling, PaymentMethod, MonthlyCharges, TotalCharges, Churn] Index: [] [0 rows x 21 columns]
# No outliers were detected.
df['Partner'] = df['Partner'].map({'Yes': 1, 'No': 0})
df['gender'] = df['gender'].map({'Female': 1, 'Male': 0})
df['Dependents'] = df['Dependents'].map({'Yes': 1, 'No': 0})
df['PaperlessBilling'] = df['PaperlessBilling'].map({'Yes': 1, 'No': 0})
df['Churn'] = df['Churn'].map({'Yes': 1, 'No': 0})
# Binary Encoding
df['No_Internet'] = df['InternetService'].map({'No': 1, 'DSL': 0, 'Fiber optic': 0})
df['No_Phone'] = df['PhoneService'].map({'No': 1, 'Yes': 0})
# Binary Encoding
df['OnlineSecurity'] = df['OnlineSecurity'].map({'Yes': 1, 'No': 0, 'No internet service': 0})
df['OnlineBackup'] = df['OnlineBackup'].map({'Yes': 1, 'No': 0, 'No internet service': 0})
df['DeviceProtection'] = df['DeviceProtection'].map({'Yes': 1, 'No': 0, 'No internet service': 0})
df['TechSupport'] = df['TechSupport'].map({'Yes': 1, 'No': 0, 'No internet service': 0})
df['StreamingTV'] = df['StreamingTV'].map({'Yes': 1, 'No': 0, 'No internet service': 0})
df['StreamingMovies'] = df['StreamingMovies'].map({'Yes': 1, 'No': 0, 'No internet service': 0})
# Binary Encoding
df['MultipleLines'] = df['MultipleLines'].map({'Yes': 1, 'No': 0, 'No phone service': 0})
# Binary Encoding
df = pd.get_dummies(df, columns=['InternetService', 'PaymentMethod', 'Contract'])
# One-Hot Encoding.
df.drop(columns=['PhoneService'], inplace=True)
df.drop(columns=['InternetService_No'], inplace=True)
# Dropping Redundant Columns.
df.info()
<class 'pandas.core.frame.DataFrame'> Index: 7032 entries, 0 to 7042 Data columns (total 28 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 customerID 7032 non-null object 1 gender 7032 non-null int64 2 SeniorCitizen 7032 non-null int64 3 Partner 7032 non-null int64 4 Dependents 7032 non-null int64 5 tenure 7032 non-null int64 6 MultipleLines 7032 non-null int64 7 OnlineSecurity 7032 non-null int64 8 OnlineBackup 7032 non-null int64 9 DeviceProtection 7032 non-null int64 10 TechSupport 7032 non-null int64 11 StreamingTV 7032 non-null int64 12 StreamingMovies 7032 non-null int64 13 PaperlessBilling 7032 non-null int64 14 MonthlyCharges 7032 non-null float64 15 TotalCharges 7032 non-null float64 16 Churn 7032 non-null int64 17 No_Internet 7032 non-null int64 18 No_Phone 7032 non-null int64 19 InternetService_DSL 7032 non-null bool 20 InternetService_Fiber optic 7032 non-null bool 21 PaymentMethod_Bank transfer (automatic) 7032 non-null bool 22 PaymentMethod_Credit card (automatic) 7032 non-null bool 23 PaymentMethod_Electronic check 7032 non-null bool 24 PaymentMethod_Mailed check 7032 non-null bool 25 Contract_Month-to-month 7032 non-null bool 26 Contract_One year 7032 non-null bool 27 Contract_Two year 7032 non-null bool dtypes: bool(9), float64(2), int64(16), object(1) memory usage: 1.1+ MB
df['customerID'] = df['customerID'].astype(str)
dt_to_int = ['gender', 'SeniorCitizen', 'Partner', 'Dependents', 'tenure', 'MultipleLines', 'OnlineSecurity',
'OnlineBackup', 'DeviceProtection', 'TechSupport', 'StreamingTV', 'StreamingMovies',
'PaperlessBilling', 'Churn', 'No_Internet', 'No_Phone',
'InternetService_DSL', 'InternetService_Fiber optic',
'PaymentMethod_Bank transfer (automatic)', 'PaymentMethod_Credit card (automatic)',
'PaymentMethod_Electronic check', 'PaymentMethod_Mailed check', 'Contract_Month-to-month',
'Contract_One year', 'Contract_Two year']
dt_to_float = ['MonthlyCharges', 'TotalCharges']
df[dt_to_int] = df[dt_to_int].astype('int8')
df[dt_to_float] = df[dt_to_float].astype('float32')
# Standardizing The Datatypes.
# Reduced the size for memory efficiency.
df.info()
<class 'pandas.core.frame.DataFrame'> Index: 7032 entries, 0 to 7042 Data columns (total 28 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 customerID 7032 non-null object 1 gender 7032 non-null int8 2 SeniorCitizen 7032 non-null int8 3 Partner 7032 non-null int8 4 Dependents 7032 non-null int8 5 tenure 7032 non-null int8 6 MultipleLines 7032 non-null int8 7 OnlineSecurity 7032 non-null int8 8 OnlineBackup 7032 non-null int8 9 DeviceProtection 7032 non-null int8 10 TechSupport 7032 non-null int8 11 StreamingTV 7032 non-null int8 12 StreamingMovies 7032 non-null int8 13 PaperlessBilling 7032 non-null int8 14 MonthlyCharges 7032 non-null float32 15 TotalCharges 7032 non-null float32 16 Churn 7032 non-null int8 17 No_Internet 7032 non-null int8 18 No_Phone 7032 non-null int8 19 InternetService_DSL 7032 non-null int8 20 InternetService_Fiber optic 7032 non-null int8 21 PaymentMethod_Bank transfer (automatic) 7032 non-null int8 22 PaymentMethod_Credit card (automatic) 7032 non-null int8 23 PaymentMethod_Electronic check 7032 non-null int8 24 PaymentMethod_Mailed check 7032 non-null int8 25 Contract_Month-to-month 7032 non-null int8 26 Contract_One year 7032 non-null int8 27 Contract_Two year 7032 non-null int8 dtypes: float32(2), int8(25), object(1) memory usage: 336.5+ KB
duplicate = df[df.duplicated()]
duplicate
customerID | gender | SeniorCitizen | Partner | Dependents | tenure | MultipleLines | OnlineSecurity | OnlineBackup | DeviceProtection | ... | No_Phone | InternetService_DSL | InternetService_Fiber optic | PaymentMethod_Bank transfer (automatic) | PaymentMethod_Credit card (automatic) | PaymentMethod_Electronic check | PaymentMethod_Mailed check | Contract_Month-to-month | Contract_One year | Contract_Two year |
---|
0 rows × 28 columns
# customerID column:
# Checked for duplicates -> 0 were detected.
# Can't drop it yet, as it could still have some hidden patterns that might correlate with Churn.
# To check for correlations, it first had to be converted to numeric.
# Had two approaches for this -
# 1) Split the customerID into numeric and the letters part, but this approach will only work if the numeric part is unique.
# But then I found that the numeric part had almost 2000 duplicates.
# 2) So I decided to go for the second approach to encode the customer id to numeric.
# After converting to numeric using Categorical Encoding, I checked for its correlation with the Prediction Feature Churn.
# It as a value extremely close to 0, So then I finally deicded to drop it.
duplicate_check = df['customerID'].duplicated().sum()
print(duplicate_check)
0
df['customer_prefix'] = df['customerID'].str[:4]
df['customer_suffix'] = df['customerID'].str[5:]
duplicate_check = df['customer_prefix'].duplicated().sum()
print(f"Number of duplicate prefixes: {duplicate_check}")
Number of duplicate prefixes: 1954
df['customerID_encoded'] = df['customerID'].astype('category').cat.codes
print(df[['customerID_encoded', 'Churn']].corr())
customerID_encoded Churn customerID_encoded 1.000000 -0.017858 Churn -0.017858 1.000000
df.drop(columns=['customerID', 'customerID_encoded', 'customer_prefix', 'customer_suffix'], inplace=True)
duplicate = df[df.duplicated()]
duplicate
gender | SeniorCitizen | Partner | Dependents | tenure | MultipleLines | OnlineSecurity | OnlineBackup | DeviceProtection | TechSupport | ... | No_Phone | InternetService_DSL | InternetService_Fiber optic | PaymentMethod_Bank transfer (automatic) | PaymentMethod_Credit card (automatic) | PaymentMethod_Electronic check | PaymentMethod_Mailed check | Contract_Month-to-month | Contract_One year | Contract_Two year | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
964 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 0 |
1338 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 0 |
1491 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 0 |
1739 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 1 | 0 | 0 |
1932 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 0 |
2713 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 0 |
2892 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 0 |
3301 | 1 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 1 | 0 | 0 |
3754 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 0 |
4098 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 0 |
4476 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 0 |
5506 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 0 |
5736 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 0 |
5759 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 1 | 0 | 0 |
6267 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 1 | 0 | 0 |
6499 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 0 |
6518 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 |
6609 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 0 |
6706 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 0 |
6764 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 1 | 0 | 0 |
6774 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 0 |
6924 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 1 | 0 | 0 |
22 rows × 27 columns
df.duplicated().sum()
22
df = df.drop_duplicates()
df['Churn'] = df.pop('Churn')
# Moved the column 'Churn' at the end.
corr = df.corr()
corr.style.background_gradient(cmap='coolwarm')
gender | SeniorCitizen | Partner | Dependents | tenure | MultipleLines | OnlineSecurity | OnlineBackup | DeviceProtection | TechSupport | StreamingTV | StreamingMovies | PaperlessBilling | MonthlyCharges | TotalCharges | No_Internet | No_Phone | InternetService_DSL | InternetService_Fiber optic | PaymentMethod_Bank transfer (automatic) | PaymentMethod_Credit card (automatic) | PaymentMethod_Electronic check | PaymentMethod_Mailed check | Contract_Month-to-month | Contract_One year | Contract_Two year | Churn | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
gender | 1.000000 | 0.001069 | 0.000583 | -0.010912 | -0.006370 | 0.008199 | 0.015839 | 0.012523 | 0.000209 | 0.007996 | 0.006488 | 0.009471 | 0.011497 | 0.012361 | -0.000879 | -0.003164 | -0.007799 | -0.007607 | 0.009898 | 0.015566 | -0.002070 | -0.001452 | -0.011727 | 0.004008 | -0.008196 | 0.003146 | 0.008694 |
SeniorCitizen | 0.001069 | 1.000000 | 0.016030 | -0.211479 | 0.014456 | 0.142403 | -0.039258 | 0.066039 | 0.058881 | -0.061293 | 0.104830 | 0.119247 | 0.155922 | 0.219131 | 0.101642 | -0.181713 | -0.008724 | -0.108914 | 0.254556 | -0.016781 | -0.024909 | 0.170949 | -0.151840 | 0.138919 | -0.047053 | -0.116898 | 0.151270 |
Partner | 0.000583 | 0.016030 | 1.000000 | 0.451254 | 0.379564 | 0.140338 | 0.141722 | 0.139971 | 0.151709 | 0.118518 | 0.122387 | 0.115979 | -0.014856 | 0.095277 | 0.317021 | 0.002823 | -0.019420 | -0.002662 | 0.000212 | 0.110009 | 0.080889 | -0.083856 | -0.093854 | -0.278229 | 0.081661 | 0.246114 | -0.148670 |
Dependents | -0.010912 | -0.211479 | 0.451254 | 1.000000 | 0.161288 | -0.026103 | 0.079591 | 0.022187 | 0.012436 | 0.061825 | -0.018146 | -0.040073 | -0.110973 | -0.114641 | 0.062762 | 0.141100 | 0.000408 | 0.050589 | -0.165140 | 0.051341 | 0.060125 | -0.149862 | 0.059159 | -0.228311 | 0.068243 | 0.200783 | -0.162366 |
tenure | -0.006370 | 0.014456 | 0.379564 | 0.161288 | 1.000000 | 0.330194 | 0.326798 | 0.359445 | 0.359833 | 0.323761 | 0.278077 | 0.283212 | 0.003709 | 0.244194 | 0.825293 | -0.033641 | -0.009217 | 0.011691 | 0.016640 | 0.242424 | 0.231385 | -0.211583 | -0.228902 | -0.648215 | 0.200872 | 0.563273 | -0.353339 |
MultipleLines | 0.008199 | 0.142403 | 0.140338 | -0.026103 | 0.330194 | 1.000000 | 0.097065 | 0.200679 | 0.200186 | 0.098884 | 0.256229 | 0.257610 | 0.163462 | 0.490016 | 0.467641 | -0.209085 | -0.280776 | -0.202182 | 0.366462 | 0.074125 | 0.059004 | 0.083436 | -0.225640 | -0.086348 | -0.004982 | 0.105286 | 0.041888 |
OnlineSecurity | 0.015839 | -0.039258 | 0.141722 | 0.079591 | 0.326798 | 0.097065 | 1.000000 | 0.282253 | 0.273833 | 0.353636 | 0.174223 | 0.186144 | -0.004622 | 0.295398 | 0.411541 | -0.332233 | 0.091098 | 0.319812 | -0.031242 | 0.093412 | 0.114550 | -0.112793 | -0.077911 | -0.245518 | 0.099739 | 0.190796 | -0.170565 |
OnlineBackup | 0.012523 | 0.066039 | 0.139971 | 0.022187 | 0.359445 | 0.200679 | 0.282253 | 1.000000 | 0.301907 | 0.292679 | 0.280308 | 0.273207 | 0.126740 | 0.440711 | 0.509049 | -0.380417 | 0.051440 | 0.155840 | 0.165547 | 0.085844 | 0.089372 | -0.000672 | -0.172195 | -0.162680 | 0.083045 | 0.110258 | -0.081145 |
DeviceProtection | 0.000209 | 0.058881 | 0.151709 | 0.012436 | 0.359833 | 0.200186 | 0.273833 | 0.301907 | 1.000000 | 0.331883 | 0.388829 | 0.401228 | 0.103705 | 0.481919 | 0.521863 | -0.379578 | 0.069401 | 0.144206 | 0.175987 | 0.081946 | 0.110196 | -0.003622 | -0.185514 | -0.224408 | 0.101868 | 0.164188 | -0.064978 |
TechSupport | 0.007996 | -0.061293 | 0.118518 | 0.061825 | 0.323761 | 0.098884 | 0.353636 | 0.292679 | 0.331883 | 1.000000 | 0.276411 | 0.279014 | 0.037060 | 0.337361 | 0.431822 | -0.335128 | 0.094559 | 0.311633 | -0.021020 | 0.099516 | 0.116094 | -0.115315 | -0.082626 | -0.284226 | 0.095327 | 0.240071 | -0.163980 |
StreamingTV | 0.006488 | 0.104830 | 0.122387 | -0.018146 | 0.278077 | 0.256229 | 0.174223 | 0.280308 | 0.388829 | 0.276411 | 1.000000 | 0.532456 | 0.224151 | 0.629336 | 0.514548 | -0.414390 | 0.020595 | 0.013681 | 0.329704 | 0.044870 | 0.038762 | 0.144760 | -0.245983 | -0.110561 | 0.060738 | 0.070836 | 0.065058 |
StreamingMovies | 0.009471 | 0.119247 | 0.115979 | -0.040073 | 0.283212 | 0.257610 | 0.186144 | 0.273207 | 0.401228 | 0.279014 | 0.532456 | 1.000000 | 0.211456 | 0.626885 | 0.518704 | -0.417891 | 0.032697 | 0.024342 | 0.322398 | 0.047498 | 0.047152 | 0.137415 | -0.248553 | -0.115873 | 0.063583 | 0.074310 | 0.062670 |
PaperlessBilling | 0.011497 | 0.155922 | -0.014856 | -0.110973 | 0.003709 | 0.163462 | -0.004622 | 0.126740 | 0.103705 | 0.037060 | 0.224151 | 0.211456 | 1.000000 | 0.350900 | 0.157449 | -0.319082 | -0.017017 | -0.064091 | 0.325295 | -0.017976 | -0.014221 | 0.207569 | -0.202521 | 0.169603 | -0.052846 | -0.147104 | 0.190518 |
MonthlyCharges | 0.012361 | 0.219131 | 0.095277 | -0.114641 | 0.244194 | 0.490016 | 0.295398 | 0.440711 | 0.481919 | 0.337361 | 0.629336 | 0.626885 | 0.350900 | 1.000000 | 0.650540 | -0.762181 | -0.249625 | -0.163695 | 0.787169 | 0.040927 | 0.028552 | 0.269931 | -0.373324 | 0.061867 | 0.003271 | -0.075152 | 0.194008 |
TotalCharges | -0.000879 | 0.101642 | 0.317021 | 0.062762 | 0.825293 | 0.467641 | 0.411541 | 0.509049 | 0.521863 | 0.431822 | 0.514548 | 0.518704 | 0.157449 | 0.650540 | 1.000000 | -0.373655 | -0.114222 | -0.053986 | 0.360768 | 0.184837 | 0.181387 | -0.061060 | -0.292598 | -0.445223 | 0.169300 | 0.357016 | -0.198362 |
No_Internet | -0.003164 | -0.181713 | 0.002823 | 0.141100 | -0.033641 | -0.209085 | -0.332233 | -0.380417 | -0.379578 | -0.335128 | -0.414390 | -0.417891 | -0.319082 | -0.762181 | -0.373655 | 1.000000 | -0.171445 | -0.379098 | -0.464418 | 0.000606 | 0.003568 | -0.282854 | 0.315183 | -0.221836 | 0.039877 | 0.220282 | -0.228220 |
No_Phone | -0.007799 | -0.008724 | -0.019420 | 0.000408 | -0.009217 | -0.280776 | 0.091098 | 0.051440 | 0.069401 | 0.094559 | 0.020595 | 0.032697 | -0.017017 | -0.249625 | -0.114222 | -0.171445 | 1.000000 | 0.452245 | -0.290997 | -0.008821 | 0.006381 | -0.002890 | 0.005708 | 0.002172 | 0.002615 | -0.005022 | -0.011072 |
InternetService_DSL | -0.007607 | -0.108914 | -0.002662 | 0.050589 | 0.011691 | -0.202182 | 0.319812 | 0.155840 | 0.144206 | 0.311633 | 0.013681 | 0.024342 | -0.064091 | -0.163695 | -0.053986 | -0.379098 | 0.452245 | 1.000000 | -0.643450 | 0.023909 | 0.050418 | -0.105062 | 0.045291 | -0.063866 | 0.046507 | 0.030032 | -0.124152 |
InternetService_Fiber optic | 0.009898 | 0.254556 | 0.000212 | -0.165140 | 0.016640 | 0.366462 | -0.031242 | 0.165547 | 0.175987 | -0.021020 | 0.329704 | 0.322398 | 0.325295 | 0.787169 | 0.360768 | -0.464418 | -0.290997 | -0.643450 | 1.000000 | -0.023384 | -0.051204 | 0.334537 | -0.304077 | 0.244634 | -0.077498 | -0.210967 | 0.307612 |
PaymentMethod_Bank transfer (automatic) | 0.015566 | -0.016781 | 0.110009 | 0.051341 | 0.242424 | 0.074125 | 0.093412 | 0.085844 | 0.081946 | 0.099516 | 0.044870 | 0.047498 | -0.017976 | 0.040927 | 0.184837 | 0.000606 | -0.008821 | 0.023909 | -0.023384 | 1.000000 | -0.279541 | -0.378198 | -0.287391 | -0.178966 | 0.056822 | 0.154215 | -0.117442 |
PaymentMethod_Credit card (automatic) | -0.002070 | -0.024909 | 0.080889 | 0.060125 | 0.231385 | 0.059004 | 0.114550 | 0.089372 | 0.110196 | 0.116094 | 0.038762 | 0.047152 | -0.014221 | 0.028552 | 0.181387 | 0.003568 | 0.006381 | 0.050418 | -0.051204 | -0.279541 | 1.000000 | -0.374894 | -0.284881 | -0.203821 | 0.066798 | 0.173645 | -0.134052 |
PaymentMethod_Electronic check | -0.001452 | 0.170949 | -0.083856 | -0.149862 | -0.211583 | 0.083436 | -0.112793 | -0.000672 | -0.003622 | -0.115315 | 0.144760 | 0.137415 | 0.207569 | 0.269931 | -0.061060 | -0.282854 | -0.002890 | -0.105062 | 0.334537 | -0.378198 | -0.374894 | 1.000000 | -0.385422 | 0.332156 | -0.109966 | -0.281924 | 0.301079 |
PaymentMethod_Mailed check | -0.011727 | -0.151840 | -0.093854 | 0.059159 | -0.228902 | -0.225640 | -0.077911 | -0.172195 | -0.185514 | -0.082626 | -0.245983 | -0.248553 | -0.202521 | -0.373324 | -0.292598 | 0.315183 | 0.005708 | 0.045291 | -0.304077 | -0.287391 | -0.284881 | -0.385422 | 1.000000 | 0.002854 | 0.002128 | -0.005351 | -0.091649 |
Contract_Month-to-month | 0.004008 | 0.138919 | -0.278229 | -0.228311 | -0.648215 | -0.086348 | -0.245518 | -0.162680 | -0.224408 | -0.284226 | -0.110561 | -0.115873 | 0.169603 | 0.061867 | -0.445223 | -0.221836 | 0.002172 | -0.063866 | 0.244634 | -0.178966 | -0.203821 | 0.332156 | 0.002854 | 1.000000 | -0.569560 | -0.621445 | 0.404346 |
Contract_One year | -0.008196 | -0.047053 | 0.081661 | 0.068243 | 0.200872 | -0.004982 | 0.099739 | 0.083045 | 0.101868 | 0.095327 | 0.060738 | 0.063583 | -0.052846 | 0.003271 | 0.169300 | 0.039877 | 0.002615 | 0.046507 | -0.077498 | 0.056822 | 0.066798 | -0.109966 | 0.002128 | -0.569560 | 1.000000 | -0.290013 | -0.177742 |
Contract_Two year | 0.003146 | -0.116898 | 0.246114 | 0.200783 | 0.563273 | 0.105286 | 0.190796 | 0.110258 | 0.164188 | 0.240071 | 0.070836 | 0.074310 | -0.147104 | -0.075152 | 0.357016 | 0.220282 | -0.005022 | 0.030032 | -0.210967 | 0.154215 | 0.173645 | -0.281924 | -0.005351 | -0.621445 | -0.290013 | 1.000000 | -0.301375 |
Churn | 0.008694 | 0.151270 | -0.148670 | -0.162366 | -0.353339 | 0.041888 | -0.170565 | -0.081145 | -0.064978 | -0.163980 | 0.065058 | 0.062670 | 0.190518 | 0.194008 | -0.198362 | -0.228220 | -0.011072 | -0.124152 | 0.307612 | -0.117442 | -0.134052 | 0.301079 | -0.091649 | 0.404346 | -0.177742 | -0.301375 | 1.000000 |
feature_importance = corr["Churn"].abs().sort_values(ascending=False)
feature_importance
Churn 1.000000 Contract_Month-to-month 0.404346 tenure 0.353339 InternetService_Fiber optic 0.307612 Contract_Two year 0.301375 PaymentMethod_Electronic check 0.301079 No_Internet 0.228220 TotalCharges 0.198362 MonthlyCharges 0.194008 PaperlessBilling 0.190518 Contract_One year 0.177742 OnlineSecurity 0.170565 TechSupport 0.163980 Dependents 0.162366 SeniorCitizen 0.151270 Partner 0.148670 PaymentMethod_Credit card (automatic) 0.134052 InternetService_DSL 0.124152 PaymentMethod_Bank transfer (automatic) 0.117442 PaymentMethod_Mailed check 0.091649 OnlineBackup 0.081145 StreamingTV 0.065058 DeviceProtection 0.064978 StreamingMovies 0.062670 MultipleLines 0.041888 No_Phone 0.011072 gender 0.008694 Name: Churn, dtype: float64
import seaborn as sns
import matplotlib.pyplot as plt
plt.figure(figsize=(8, 6))
sns.heatmap(corr[["Churn"]].abs().sort_values(by="Churn", ascending=False), annot=True, cmap="coolwarm")
plt.title("Feature Importance")
plt.show()
threshold = 0.7
high_corr_pairs = corr.unstack().sort_values(ascending=False)
high_corr_pairs = high_corr_pairs[high_corr_pairs != 1]
high_corr_pairs = high_corr_pairs[abs(high_corr_pairs) > threshold]
print(high_corr_pairs)
TotalCharges tenure 0.825293 tenure TotalCharges 0.825293 InternetService_Fiber optic MonthlyCharges 0.787169 MonthlyCharges InternetService_Fiber optic 0.787169 No_Internet MonthlyCharges -0.762181 MonthlyCharges No_Internet -0.762181 dtype: float64
high_corr_matrix = corr[abs(corr) > threshold]
high_corr_matrix = high_corr_matrix[high_corr_matrix != 1]
plt.figure(figsize=(15, 8))
sns.heatmap(high_corr_matrix, annot=True, cmap="coolwarm")
plt.title("Highly Correlated Features (Multicollinearity)")
plt.show()
from sklearn.model_selection import train_test_split
import sweetviz as sv
train_df, test_df = train_test_split(df, train_size=0.80)
compare = sv.compare([train_df, "Training Data"], [test_df, "Test Data"], "Churn")
compare.show_html('Compare.html')
| | [ 0%] 00:00 -> (? left)
Report Compare.html was generated! NOTEBOOK/COLAB USERS: the web browser MAY not pop up, regardless, the report IS saved in your notebook/colab files.
from ydata_profiling import ProfileReport
profile = ProfileReport(df)
profile.to_file("ProfileReport.html")
Summarize dataset: 0%| | 0/5 [00:00<?, ?it/s]
/Users/vibhavmisra/Softwares/anaconda3/lib/python3.12/site-packages/ydata_profiling/model/pandas/discretize_pandas.py:52: FutureWarning: Setting an item of incompatible dtype is deprecated and will raise in a future error of pandas. Value '[9 0 0 ... 9 0 0]' has dtype incompatible with int8, please explicitly cast to a compatible dtype first. discretized_df.loc[:, column] = self._discretize_column( /Users/vibhavmisra/Softwares/anaconda3/lib/python3.12/site-packages/ydata_profiling/model/pandas/discretize_pandas.py:52: FutureWarning: Setting an item of incompatible dtype is deprecated and will raise in a future error of pandas. Value '[0 0 0 ... 0 9 0]' has dtype incompatible with int8, please explicitly cast to a compatible dtype first. discretized_df.loc[:, column] = self._discretize_column( /Users/vibhavmisra/Softwares/anaconda3/lib/python3.12/site-packages/ydata_profiling/model/pandas/discretize_pandas.py:52: FutureWarning: Setting an item of incompatible dtype is deprecated and will raise in a future error of pandas. Value '[9 0 0 ... 9 9 0]' has dtype incompatible with int8, please explicitly cast to a compatible dtype first. discretized_df.loc[:, column] = self._discretize_column( /Users/vibhavmisra/Softwares/anaconda3/lib/python3.12/site-packages/ydata_profiling/model/pandas/discretize_pandas.py:52: FutureWarning: Setting an item of incompatible dtype is deprecated and will raise in a future error of pandas. Value '[0 0 0 ... 9 0 0]' has dtype incompatible with int8, please explicitly cast to a compatible dtype first. discretized_df.loc[:, column] = self._discretize_column( /Users/vibhavmisra/Softwares/anaconda3/lib/python3.12/site-packages/ydata_profiling/model/pandas/discretize_pandas.py:52: FutureWarning: Setting an item of incompatible dtype is deprecated and will raise in a future error of pandas. Value '[0 4 0 ... 1 0 9]' has dtype incompatible with int8, please explicitly cast to a compatible dtype first. discretized_df.loc[:, column] = self._discretize_column( /Users/vibhavmisra/Softwares/anaconda3/lib/python3.12/site-packages/ydata_profiling/model/pandas/discretize_pandas.py:52: FutureWarning: Setting an item of incompatible dtype is deprecated and will raise in a future error of pandas. Value '[0 0 0 ... 0 9 0]' has dtype incompatible with int8, please explicitly cast to a compatible dtype first. discretized_df.loc[:, column] = self._discretize_column( /Users/vibhavmisra/Softwares/anaconda3/lib/python3.12/site-packages/ydata_profiling/model/pandas/discretize_pandas.py:52: FutureWarning: Setting an item of incompatible dtype is deprecated and will raise in a future error of pandas. Value '[0 9 9 ... 9 0 9]' has dtype incompatible with int8, please explicitly cast to a compatible dtype first. discretized_df.loc[:, column] = self._discretize_column( /Users/vibhavmisra/Softwares/anaconda3/lib/python3.12/site-packages/ydata_profiling/model/pandas/discretize_pandas.py:52: FutureWarning: Setting an item of incompatible dtype is deprecated and will raise in a future error of pandas. Value '[9 0 9 ... 0 0 0]' has dtype incompatible with int8, please explicitly cast to a compatible dtype first. discretized_df.loc[:, column] = self._discretize_column( /Users/vibhavmisra/Softwares/anaconda3/lib/python3.12/site-packages/ydata_profiling/model/pandas/discretize_pandas.py:52: FutureWarning: Setting an item of incompatible dtype is deprecated and will raise in a future error of pandas. Value '[0 9 0 ... 0 0 9]' has dtype incompatible with int8, please explicitly cast to a compatible dtype first. discretized_df.loc[:, column] = self._discretize_column( /Users/vibhavmisra/Softwares/anaconda3/lib/python3.12/site-packages/ydata_profiling/model/pandas/discretize_pandas.py:52: FutureWarning: Setting an item of incompatible dtype is deprecated and will raise in a future error of pandas. Value '[0 0 0 ... 0 0 9]' has dtype incompatible with int8, please explicitly cast to a compatible dtype first. discretized_df.loc[:, column] = self._discretize_column( /Users/vibhavmisra/Softwares/anaconda3/lib/python3.12/site-packages/ydata_profiling/model/pandas/discretize_pandas.py:52: FutureWarning: Setting an item of incompatible dtype is deprecated and will raise in a future error of pandas. Value '[0 0 0 ... 0 0 9]' has dtype incompatible with int8, please explicitly cast to a compatible dtype first. discretized_df.loc[:, column] = self._discretize_column( /Users/vibhavmisra/Softwares/anaconda3/lib/python3.12/site-packages/ydata_profiling/model/pandas/discretize_pandas.py:52: FutureWarning: Setting an item of incompatible dtype is deprecated and will raise in a future error of pandas. Value '[0 0 0 ... 0 0 9]' has dtype incompatible with int8, please explicitly cast to a compatible dtype first. discretized_df.loc[:, column] = self._discretize_column( /Users/vibhavmisra/Softwares/anaconda3/lib/python3.12/site-packages/ydata_profiling/model/pandas/discretize_pandas.py:52: FutureWarning: Setting an item of incompatible dtype is deprecated and will raise in a future error of pandas. Value '[9 0 9 ... 9 9 9]' has dtype incompatible with int8, please explicitly cast to a compatible dtype first. discretized_df.loc[:, column] = self._discretize_column( /Users/vibhavmisra/Softwares/anaconda3/lib/python3.12/site-packages/ydata_profiling/model/pandas/discretize_pandas.py:52: FutureWarning: Setting an item of incompatible dtype is deprecated and will raise in a future error of pandas. Value '[0 0 0 ... 0 0 0]' has dtype incompatible with int8, please explicitly cast to a compatible dtype first. discretized_df.loc[:, column] = self._discretize_column( /Users/vibhavmisra/Softwares/anaconda3/lib/python3.12/site-packages/ydata_profiling/model/pandas/discretize_pandas.py:52: FutureWarning: Setting an item of incompatible dtype is deprecated and will raise in a future error of pandas. Value '[9 0 0 ... 9 0 0]' has dtype incompatible with int8, please explicitly cast to a compatible dtype first. discretized_df.loc[:, column] = self._discretize_column( /Users/vibhavmisra/Softwares/anaconda3/lib/python3.12/site-packages/ydata_profiling/model/pandas/discretize_pandas.py:52: FutureWarning: Setting an item of incompatible dtype is deprecated and will raise in a future error of pandas. Value '[9 9 9 ... 9 0 0]' has dtype incompatible with int8, please explicitly cast to a compatible dtype first. discretized_df.loc[:, column] = self._discretize_column( /Users/vibhavmisra/Softwares/anaconda3/lib/python3.12/site-packages/ydata_profiling/model/pandas/discretize_pandas.py:52: FutureWarning: Setting an item of incompatible dtype is deprecated and will raise in a future error of pandas. Value '[0 0 0 ... 0 9 9]' has dtype incompatible with int8, please explicitly cast to a compatible dtype first. discretized_df.loc[:, column] = self._discretize_column( /Users/vibhavmisra/Softwares/anaconda3/lib/python3.12/site-packages/ydata_profiling/model/pandas/discretize_pandas.py:52: FutureWarning: Setting an item of incompatible dtype is deprecated and will raise in a future error of pandas. Value '[0 0 0 ... 0 0 9]' has dtype incompatible with int8, please explicitly cast to a compatible dtype first. discretized_df.loc[:, column] = self._discretize_column( /Users/vibhavmisra/Softwares/anaconda3/lib/python3.12/site-packages/ydata_profiling/model/pandas/discretize_pandas.py:52: FutureWarning: Setting an item of incompatible dtype is deprecated and will raise in a future error of pandas. Value '[0 0 0 ... 0 0 0]' has dtype incompatible with int8, please explicitly cast to a compatible dtype first. discretized_df.loc[:, column] = self._discretize_column( /Users/vibhavmisra/Softwares/anaconda3/lib/python3.12/site-packages/ydata_profiling/model/pandas/discretize_pandas.py:52: FutureWarning: Setting an item of incompatible dtype is deprecated and will raise in a future error of pandas. Value '[9 0 0 ... 9 0 0]' has dtype incompatible with int8, please explicitly cast to a compatible dtype first. discretized_df.loc[:, column] = self._discretize_column( /Users/vibhavmisra/Softwares/anaconda3/lib/python3.12/site-packages/ydata_profiling/model/pandas/discretize_pandas.py:52: FutureWarning: Setting an item of incompatible dtype is deprecated and will raise in a future error of pandas. Value '[0 9 9 ... 0 9 0]' has dtype incompatible with int8, please explicitly cast to a compatible dtype first. discretized_df.loc[:, column] = self._discretize_column( /Users/vibhavmisra/Softwares/anaconda3/lib/python3.12/site-packages/ydata_profiling/model/pandas/discretize_pandas.py:52: FutureWarning: Setting an item of incompatible dtype is deprecated and will raise in a future error of pandas. Value '[9 0 9 ... 9 9 0]' has dtype incompatible with int8, please explicitly cast to a compatible dtype first. discretized_df.loc[:, column] = self._discretize_column( /Users/vibhavmisra/Softwares/anaconda3/lib/python3.12/site-packages/ydata_profiling/model/pandas/discretize_pandas.py:52: FutureWarning: Setting an item of incompatible dtype is deprecated and will raise in a future error of pandas. Value '[0 9 0 ... 0 0 0]' has dtype incompatible with int8, please explicitly cast to a compatible dtype first. discretized_df.loc[:, column] = self._discretize_column( /Users/vibhavmisra/Softwares/anaconda3/lib/python3.12/site-packages/ydata_profiling/model/pandas/discretize_pandas.py:52: FutureWarning: Setting an item of incompatible dtype is deprecated and will raise in a future error of pandas. Value '[0 0 0 ... 0 0 9]' has dtype incompatible with int8, please explicitly cast to a compatible dtype first. discretized_df.loc[:, column] = self._discretize_column( /Users/vibhavmisra/Softwares/anaconda3/lib/python3.12/site-packages/ydata_profiling/model/pandas/discretize_pandas.py:52: FutureWarning: Setting an item of incompatible dtype is deprecated and will raise in a future error of pandas. Value '[0 0 9 ... 0 9 0]' has dtype incompatible with int8, please explicitly cast to a compatible dtype first. discretized_df.loc[:, column] = self._discretize_column(
Generate report structure: 0%| | 0/1 [00:00<?, ?it/s]
Render HTML: 0%| | 0/1 [00:00<?, ?it/s]
Export report to file: 0%| | 0/1 [00:00<?, ?it/s]